Jax: Scanner Generator Examples
This section walks through a couple of toy examples to show how you can generate scanners with jax. If you are familiar with flex or lex, you can just skim through the examples to see the syntax, and jump to the reference.
# Count the number of words in a file
#
# Header section
%{
import java.io.*;
public class wc
{
int word_count = 0;
public static void main(String argv[]) throws IOException
{
wc myLexer = new wc();
myLexer.init(System.in);
myLexer.jax_next_token();
System.out.println(myLexer.word_count + " words");
}
%}
# A word is assumed to be any sequence of characters
# that is not a blank, tab or newline.
/[^\_\n\t]+/ # The regular expression to match a word
%{ word_count++; %} # And its associated action.
; # dont forget the trailing semicolon
# Trailing section
%{
}
%}
A jax specification file has three parts. The header, the
regular expressions to be matched, and the trailer. The
header and trailer are enclosed within %{ .. %}
and are reproduced in the output file. Any actions associated with a
regular expression are also specified the same way. Jax
processes this file and generates a java file with a
function jax_next_token() which is used to start
the matching process. You have to first prime the
lexer though, and you do that by calling the init()
method with an inputstream.
To generate the scanner, first run jax on the file, and then compile the generated file. The same example is provided in the distribution. If you are in the root of the distribution, this is how it might work.
% java sbktech.tools.jax.driver -lexFile wc.java examples/wc.lex
% javac wc.java
% java wc < wc.java
677 words
% wc wc.java
255 677 6723 wc.java
%
Lets take a look at the regular expression itself for a bit. The first
thing is that unlike flex (and like perl) regular expressions are
specified within the slash (/) characters. Backslashes are
used to escape any special characters. White space in patterns is not
significant, so /abc/ is the same as
/a b   c/. To represent a blank, use
\_ and there are only a few more surprises in the the syntax
which is rather like lex, except that jax does not provide ^
or $ or a/b operators to provide context sensitive
matching.
The definition from the official comment specification says
A comment declaration consists of `<!' followed by zero or more comments followed by `>'. Each comment starts with `--' and includes all text up to and including the next occurrence of `--'. In a comment declaration, white space is allowed after each comment, but not before the first comment.Here is one way to handle this through a jax specification.
/ <!
( -- ( [^\-] | -[^\-] )* -- [\_\r\n\t]* )*
> /
%{ htmlComments.append(jax_text()); %} ;
# Match the rest quickly
/[^<]+/
%{ htmlContents.append(jax_text()); %} ;
/</
%{ htmlContents.append('<'); %} ;
Only the interesting parts of the specification are shown.
Here is the complete program, which
is also present in the distribution. If you are in the root of
the distribution, here is how you might compile and run it.
% java sbktech.tools.jax.driver -lexFile htmlsplit.java examples/htmlsplit.lex % javac htmlsplit.java % java htmlsplit examples/browbust.html Html ==== <html><head><title>Examples of comment processing bugs</title></head><body> [...] In correct browsers, this will be the last sentence on this page.<p> <p> <p> <p> <p> <p> <p> <p> </body></html> Comments ======== <!-- Your browser doesn't handle comments that cross a line boundary--> [...]
KB Sriram
Comments, bug reports: kbs@sbktech.org
Revised: Wed Jul 24 08:01:13 1996
URL: http://www.sbktech.org/jax-ex1.html